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1-H Abstract 

^ One popular approach for nonstructural economic and financial forecasting is to include a 

f5 large number of economic and financial variables, which has been shown to lead to significant 



improvements for forecasting, for example, by the dynamic factor models. A challenging issue 



^ is to determine which variables and (their) lags are relevant, especially when there is a mixture 



of serial correlation (temporal dynamics) , high dimensional (spatial) dependence structure and 
moderate sample size (relative to dimensionality and lags). To this end, an integrated solution 
that addresses these three challenges simultaneously is appealing. We study the large vector 
^ auto regressions here with three types of estimates. We treat each variable's own lags different 

from other variables' lags, distinguish various lags over time, and is able to select the variables 
^ and lags simultaneously. We first show the consequences of using Lasso type estimate directly for 

^ time series without considering the temporal dependence. In contrast, our proposed method can 

^ still produce an estimate as efficient as an oracle under such scenarios. The tuning parameters 

are chosen via a data driven "rolling scheme" method to optimize the forecasting performance. 
^— I A macrocconomic and financial forecasting problem is considered to illustrate its superiority over 

existing estimators. 

Keywords: Time Series, Vector Auto Regression, Regularization, Lasso, Group Lasso, Oracle 
estimator 
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1 Introduction 

Macroeconomic forecasting is one of the central tasks in Economics. Broadly speaking, there are two 

approaches, structural and nonstructural forecasting. Structural forecasting, which aligns itself with 
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economic theory, and hence rises and falls with that, recedes following the decline of Keynesian theory. 
In recent years, new dynamic stochastic general equilibrium theory has been developed, and structural 
macroeconomic forecasting is poised for resurgence. Nonstructural forecasting, in contrast, attempts 
to exploit the reduced-form correlations in observed macroeconomic time series, has little reliance 
on economic theory, has always been working well and continues to be improved. Various univariate 
and multivariate time series analyzing techniques have been proposed, e.g. the auto regression (AR), 
moving average (MA), autoregressive moving average (ARMA), generalized autoregressive conditional 
heteroskedasticity (GARCH), vector auto regression (VAR) models among many others. A very 
challenging issue for this nonstructural approach is to determine which variables and (their) lags are 
relevant. If we omit some "important" variables by mistake, it potentially creates an omitted variable 
bias with adverse consequences for both structural analysis and forecasting. For example, [Christiano 



et al. ( 1999 ) points out that the positive reaction of prices in response to a monetary tightening, the 



so-called price puzzle, is an artefact resulting from the omission of forward-looking variables, such as 



the commodity price index. Recently, Bahbura et al. (2010) shows that, when using the cross-sectional 
dimension related shrinkage, the forecasting performance of small monetary vector auto regression can 
be improved by adding additional macroeconomic variables and sectoral information. To illustrate 
this, we consider an example of interest rate forecasting. Nowadays people primarily use univariate 
or multivariate time series models, e.g. the Vasicek, CIR, Jump-Diffusion, Regime-Switching, and 
time-varying coefficients models, all of which are mostly based on the information from the interest 
rate time series itself. However, in practice, the central bank (Fed) bases their decisions of interest 
rate adjustment (as a monetary policy instrument) heavily on the national macroeconomic situation 
by taking many macro and financial measures into account. Bringing in this additional spatial (over 
the space of variables instead of from a geographic point of view; also used in future for convenience) 
information will therefore help improve its forecasting performance. Another example about the 
interactions between macroeconomics and finance comes from modeling credit defaults by also using 
macroeconomic information, since variation in aggregate default rates over time presumably refiects 



changes in general economic conditions also. Figlewski et al. (2006) find credit events are significantly 
affected by macroeconomic factors. Not only macroeconomics could affect finance, finance could also 
affect macroeconomics. For example, the economic crisis typically starts from the stock market crash. 
All of these call for an integrated analysis of macroeconomics and finance. Thus recently there has 
been a growing trend of using large panel macroeconomic and financial time series for forecasting. 



impulse response study and structural analysis, Forni et al. (2000), Stock and Watson (2002a), Stock 



and Watson (2002b), also seen at Forni et al. (2005), Stock and Watson (2005b), Giannone et al. 
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(2005), and Banbura et al. (2010) for latest advancements. 

Besides its presence in empirical macroeconomics, high dimensional data, where information often 
scatters through a large number of interrelated time series, is also attracting increasing attention in 
many other fields of economics and finance. In neuro-economics and behavioral finance, one uses 
high dimensional functional magnetic resonance imaging data (fMRI) to analyze the brain's response 



to certain risk related stimuli as well as identifying its activation area, Worsley et al. (2002) and 



Mysickova et al. (2011). In quantitative finance, one studies the dynamics of the implied volatility 



surface for risk management, calibration and pricing purposes, Fengler et al. (2007). Other examples 



and research fields for very large dimensional time series include mortality analysis, Lee and Carter 



(1992); bond portfolio risk management or derivative pricing. Nelson and Siegel (1987) and Diebold 



and Li (2006); international economy (many countries); industrial economy (many firms); quantitative 



finance (many assets) analysis among many others. 

On the methodology side, if people still use either low dimensional (multivariate) time series 
techniques on a few subjectively (or from some background knowledge) selected variables or high 
dimensional "static" methods which are initially designed for independent data, they might either 
disregard potentially relevant information (temporal dynamics and spatial dependence) to produce 
suboptimal forecasts, or bring in additional risk. Examples include the already mentioned prize puzzle 
and interest rate forecasting problems. The more scattered and dynamic the information is, the severer 
this loss becomes. This modeling becomes more challenging under the situation that macroeconomic 
data we typically deal with has only low frequencies, e.g. monthly or yearly. For example, the pop- 



ularly used dataset introduced by Stock and Watson (2005a) contains 131 monthly macro indicators 
covering a broad range of categories including income, industrial production, capacity, employment 
and unemployment, consumer prices, producer prices, wages, housing starts, inventories and orders, 
stock prices, interest rates for different maturities, exchange rates and money aggregates and so on. 
The time span is from January 1959 to December 2003 (so T = 540). In summary, we can see that the 
challenge of modeling high dimensional time series, especially the macroeconomic ones, comes from 
a mixture of serial correlation (temporal dynamics), high dimensional (spatial) dependence structure 
and moderate sample size (relative to dimensionality and lags). To this end, an integrated solution 
addressing these three challenges simultaneously is appealing. 

To circumvent this problem, dynamic factor models have been considered to be quite successful 



recently in the analysis of large panels of time series data, Forni et al. (2000), Stock and Watson 



(2002a), Stock and Watson (2002b), also seen at Forni et al. (2005), Giannone et al. (2005), Park 



et al. (2009) and Song et al. (2010) (nonstationary case). They rely on the assumption that the bulk 
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of dynamics interrelations within a large dataset can be explained and represented by a few common 
factors (low dimensional time series). Less general models in the literature include static factor models 
proposed by Stock and Watson ( 2002a[ ), Stock and Watson (2002b) and exact factor model suggested 



by Sargent and Sims (1977) and Geweke (1977). 

Compared with the well studied dynamic factor models through the use of dynamic principal 
component analysis, the vector auto regressive (VAR) models have several natural advantages. For 
example, compared with the dynamic factor models' typical 2-step estimation procedure: dimension 
reduction first and low dimensional time series modeling, the VAR approach is able to model the high 
dimensional time series in one step, which may lead to greater efficiency. It also allows variable-to- 
variable relationship (impulse response) analysis and facilitates corresponding interpretation, which 
is not feasible in the factor modeling setup since the variables are "represented" by the corresponding 
factors. Historically, the VAR models are not appropriate for analyzing high dimensional time series 
because they involve the estimation of too many {J^P, where J is the dimensionality and P is the 
number of lags) parameters. Thus they are primarily implemented on relatively low dimensional 



situations, e.g. the Baysian VARs (BVAR) by Doan et al. (1984) or still through the idea of factor 



modeling, e.g. the factor-augmented VAR (FAVAR) by Bernanke et al. (2005). However, based on 



recent advances in variable selection, shrinkage and regularization theory from Tibshirani (1996), Zou 



(2006) and Yuan and Lin (2006), large unrestricted vector auto regression becomes an alternative 
for the analysis of large dynamic systems. Therefore, the VAR framework can also be applied to 



empirical problems that require the analysis of more than a handful of time series. Mol et al. (2008) 



and Bahbura et al. (2010) proceed that from the Bayesian point of view. Chudik and Pesaran (2007) 
consider the case that P = 1 and both J and T are large through some "neighboring" procedure, 
which can be viewed as a special case of the "segmentized grouping" as we study here (details in 



Subsection 2.4). In the univariate case {J = 1), Wang et al. (2007) studies the regression coefficient 
and autoregressive order shrinkage and selection via the lasso when P is large and T is small (relative 
to P). 

In this article, we will study the large vector auto regressions when J,P-^oo and T is moderate 
(relative to JP). Comparing to prior works in (large) vector auto regressions, the novelty of this 
article lies in the following perspectives. First, from the variable selection and regularization point of 
view, the theoretical properties of many existing methods have been established under the indepen- 
dent scenario, which is rarely met in practice and contradicts the original time series setup (if used 
directly). Disregarding the serial correlation in variable selection and regularization can be dangerous 
in the sense that various risk bounds in fact depend on the degree of time dependence, as we will 
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illustrate later. We propose a new methodology to address this serial correlation (time dependence) 
issue together with high dimensionality and moderate sample size, which enables us to obtain the 
consistency of variable selection even under the dependent scenario, i.e. to reveal the equilibrium 
among them. Second, our method is able to do variable selection and lag selection simultaneously. In 
previous literature, variable selection is usually carried out first, and then the corresponding estimate's 
performances w.r.t. different number of lags are compared through some information criteria to select 
the "optimal" number of lags. By doing so, we neglect the "interaction" between variable selection 
and lag selection. Additionally, when the number of the lag's candidates to be searched over is large, 
it is also computationally inefficient, due to the cost of the repeated variable selection procedures. 
Third, we differentiate the variable of interest's own lags (abbreviated as own lags afterwards) from 
the ones of other variables (abbreviated as others' lags afterwards). Their relative weights are also 
allowed to be varied when predicting different variables, while in other literature, they are assumed 
to stay the same. This is due to the fact that the dynamic of some variables is driven by itself, while 
for a different variable, it might be driven by the dynamics of others. When we include a vast number 
of macroeconomic and financial time series, assuming the same weight seems to be too restrictive. 
Fourth, our method is based on a more computationally efficient approach, which mostly uses the 



existing packages, e.g. the LARS (least angle regression) package, developed by Efron et al. (2004), 
while most other works in the literature go through the Bayesian approach that requires the choice 
of priors. 

The rest of the article is organized as follows. In the next section, we present the main ingredients 
of the large vector autoregressive model (Large VAR) together with several corresponding estimation 
procedures and comparisons among them. The estimates' properties are presented in Section [3j 
In Section |4], the method is applied to the motivating macroeconomic forecasting problem, which 
shows that it outperforms some major existing method. Section |5] contains concluding remarks with 
discussions about relaxing some assumptions. All technical proofs are sketched in the appendix. 



2 The Large VAR Model and Its Estimation 

In this section, we introduce the model with three different estimates first, then discuss the data driven 
choice of hyperparameters to optimize the forecasting performance, provide a numerical algorithm, 
and finally summarize comparisons among these three estimates. 
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2.1 The Model 



Assume that the high dimensional time series is generated from 
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where 



F = (yT,...,yT 



with y: ' 



• X = (X^, . . . , X7)T (the lags of Y) with Xt = {Y^\, F^Tp)^; 

• Bi, . . . , i?p are J x J autoregressive matrices, where P is the number of lags, B = (Bi, . . . , Bp)~^ 
is the JP X J matrix containing all coefficients {Bpij}p^^j^{^j^^, and B..j, Bp.j, B.i., Bpi. is the 
jth column of B and -Bp, ith row of -B and Bp respectively; 

• U = {U^ , . . . , Uj)~^ , where Ut is a J-dimensional noise and independent of Xt. 

All Yt,Xt and f/^ are assumed to have mean zero. The J x J covariance matrix of Ut, Cov(f/t), is 
assumed to be independent of t. Here we assume Cov(f/t) to be diagonal, say In our case, it is 
justified by the fact that the variables in the panel we will consider for estimation are standardized 



and demeaned. Similar assumption is also carried out in Mol et al. (2008). The relaxation allowing 
nonzero off-diagonal entries is discussed in Section |5} 

We can see that given large J and P, we have to estimate a total of J^P parameters, which is 
much larger than the moderate number of observations JT, i.e. JP ^ T. Consequently, ordinary 
least squares estimation is not feasible. Additionally, due to the structural change points in the macro 
and financial data (although not explored in this paper), the effective number of observations used for 
estimation could be much smaller than the original T. Thus we can see that on one hand, we do not 
want to impose any restrictions on the parameters and attain some general representations; on the 
other hand, it is known that making the model unnecessarily complex can degrade the efficiency of 
the resulting parameter estimate and yield less accurate predictions, as well as making interpretation 
and variable selection difficult. Hence, to avoid over fitting, regularization and variable selection are 
necessary. In the following, we are going to discuss the estimation procedure with different kinds of 



regularization (illustrated in Figure [T]). Before moving on, we incorporate the following very mild 
belief, as also considered in Banbura et al. (2010): the more recent lags should provide more reliable 
information than the more distant ones, which tries to strike a balance between attaining model 
simplicity and keeping the historic information. 
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v 



Figure 1: Illustration of three different types of estimates. 



2.2 Universal Grouping 



Without loss of generality, we start from considering one coefficient matrix, say Bp with entries 



{Bpij, 1 ^ i,j ^ J}. Inspired by Banbura et al. (2010), we note the fact that the dynamic of some 
variable is driven by itself, while for a different variable, it might be driven by the dynamics of 
others. Consequently, we treat the variables' own lags (diagonal terms of Bp) different from others' 
lags (off-diagonal terms of Bp) and impose different regularizations for them. We assume that the 
off-diagonal coefficients of Bp are not only sparse, but also have the same sparsity pattern across 
different columns, which we call group sparsity. Thus we base our selection solution on group Lasso 



techniques (Yuan and Lin (2006)) for the off-diagonal terms and Lasso techniques (Tibshirani (1996)) 



for the diagonal terms here. We use Bpj^j to denote the vector composed of {Bpji}i-^j and W-j to 
denote the (J — 1) x (J — 1) diagonal matrix diag[wi, . . . , Wj+i, • • • , wj] where Wi is the positive 
real-valued weight associated with the ith variable for 1 ^ z ^ J. It is included here primarily for 



practical implementation since if wt is chosen as the Std(Fi), it is equivalent (subsection 2.6 for details) 
to standardize the predictors so that they all have zero mean and unit variance, [Tibshirani (1996), 
which is also preferable for comparisons to prior works. 

Specifically, given the above notations, we use the group Lasso type penalty ll-Bpi-jW^-jlb and 



Lasso type penalty X]j=i ""^il-^piil ^'^ impose regularizations on other regressors' lags and predicted 
variables' own lags respectively and have the following penalty for the Bp matrix: 

J J 

|2 + /i^«;,|5p,-,| < (3) 



p]-j^-j\ 



with some generic constant C. 

• The hyperparameter // controls the extent to which others' lags are less (more) "important" 
than the own lags. When /i is large, the penalty assigned to own lags is larger than to others' 
lags. As a result, it is more likely that the off-diagonal entries are shrunlc to instead of the 
diagonal ones, which corresponds to the case that the variable's dynamic is driven by itself, and 
vice versa when /x is small. 

• The item reflects different regularization for different lags (over time). It becomes smaller 
when p gets larger. This is consistent with the previous belief: the more recent lags should 
provide more reliable information than the more distant ones. Thus as a result, large amounts of 
shrinkage are towards the more distant lags, whereas small amounts of shrinkage are towards the 
more recent ones. The hyperparameter a governs the relative importance of distant lags w.r.t. 
the more recent ones. Other decreasing functions of p, e.g. f{p) = log(p)~", /(p) = exp(p)~" 
could also be used. However, we do not consider a general representation (and use a data driven 
way to estimate /(I), . . . , f{p), • • • , f{P) correspondingly) to avoid too many tuning parameters, 
especially when P oo. 

Since we have P coefficient matrices Bi, . . . , Bp, summing ([s]) up over p (after multiplying p°' on both 
sides) yields X]p=i X]j=i P"ll-^pj-i^-j II2 + l^J2p=iJ2j=iP"'^j\Ppjj\ ^ If couple this to the 

quadratic loss {2J{T-P)}~^ Ef=p+i 11^/ -^7^112 through La grange multipliers, we have equation 

T P J P J 

nnn{J(T — P)} ^ ^..^ ^112 1 '^\/_^/_^p ii-^pj-j -j 11^ ' P' /_^p "^ji-^pjji 

t=p+i p=i j=i p=i j=i 



t=p+i p=i j=i p=i j=i 

T P J P J 

{j{T-p)}-' rr-^75ii^ + A$^$^pii5p,-,w_,ii2+7EE^'"^^-i^^^-^-i(4) 



T P J P J 

= mm 

B 

t=p+i p=i j=i p=i j=i 



with hyperparameters A, 7 and a. We call this estimate B the universal grouping estimate. As 
the number of variables J increases, the autocoefficients should be shrunk more in order to avoid 



over-fitting, as already discussed by Mol et al. (2008) 



Using the group Lasso type regularization for the off-diagonal terms actually poses some strong 
assumptions on the underlying structure, which is not realistic from an economic point of view. 
Remark 2.2.1 First, we just have one hyperparameter /i (/i = 7/A) to control the relative weights 
between own lags and others' lags. This means that the weights between own's lags and others' lags 
are the same across different dimensions which is hardly met in practice. Correspondingly, when 
we select the "optimal" /i to optimize the forecasting performance, we are actually optimizing the 
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averaged forecasting performance for all J variables instead of the variable of particular interest. This 



might produce suboptimal forecasts. Bahbura et al. (2010) considers a special case that own lags 



are always more "important" than others' lags, which might be less general than ours. Remark 
2.2.2 Second, using the L2 norm ||i?pj_jiy_j||2 might shrink all off-diagonal terms in the same row 
{{Bpj.}.-Lj) to zero simultaneously, which implicitly means that, for the jth corresponding variable, 
we assume it is either significant for all the other J — 1 variables or not for any other J — 1 variables 
at all. This is, again, too strong from an economic point of view. 

2.3 No Grouping 

To amend the deficiencies of the universal grouping estimate, we estimate the autocoefficient matrix 
B column by column instead of all at once. Without loss of generality, we consider the jth column 
B..j here. Since B..j is a vector, we can use the Lasso type penalties for both own lags and others' 



lags. By following similar ideas and abbreviations in subsection 2.2, we have equation ([s]) to get B..f 



T f f 



mm 

B 

t=p+l p=l i^j p=l 



T P P 

min(T - Py' (^*^- - ^^B-^)" + E E^^^^I^P^^I + 7.Ep>I^p..I (5) 
■ t=p+i p=i ij^j p=i 

with hyperparameters Xj, and a. The subindex j is added to Xj and 7^ to emphasize that they 
could vary when estimating different -B-./s, 1 ^ j ^ J. We call this estimate B = {B..i, . . . , B..j) the 
no grouping estimate. 

Remark 2.3.1 Because of different /x/s {fij = ^j/Xj) for different columns' estimates B..j, we 
allow individualized weights between own lags and others' lags and could tune Aj's and 7j's to produce 
optimal forecasting performance for each variable of interest, say the jth. Remark 2.3.2 Also for 
the same reason, we could get rid of the disadvantage that all off-diagonal terms in one row might be 
shrunk to simultaneously. 

For simplicity of notation, we drop the common subindex j and write Ytj = yt, B..j = /3, Bpij = 
Ci,i ^ j, Bpjj = dp, Xj = A, 7j = 7, and ([s]) becomes: 

T P P 

mrnQriP) = min(T - P)-^ (y, - X^/?)^ + A + 7$^p"ti;p|cip| 

t=p+i p=i i^j p=i 

T ^'(^-l) P 

= min(T-P)-i ^ (^,_X7/3)^ + A. ^ |c,|+7,^|rfp| (6) 

t=P+l i=l p=l 

with Aj = Xp"Wi and 7^ = •yp^'Wp. 



2.4 Segmentized Grouping 

Since the large panel of macroeconomic and financial data sets usually have some natural "segment" 
structure, e.g. multiple interest rate time series w.r.t. different maturities, different number of em- 
ployees w.r.t. different industrial sectors, different price indices w.r.t. different goods etc, if we take 
this information into account instead of estimating B either all at once or column by column, we 
could also do it segment by segment. Without loss of generality, we consider the ith segment B..j^^. 
Mi is the index set for the zth segment and Ni = \Afi\ denotes the cardinality of the set A/^. Yjj-, is 
the corresponding part of . We also use to denote the Ni x Ni diagonal matrix with diagonal 
entries {wi}i^j'^. and Wj\f^-j to denote the {Ni — 1) x (Ni — 1) diagonal matrix with diagonal entries 

Under this situation, we have: own lags, others' (in the same segment) lags and others' (outside the 
segment) lags for the estimation of the ith segment's corresponding autoregressive coefficients B..j^^. 
Also following similar ideas and abbreviations in subsection 2.2[ we have the following estimation 
equation: 

mms..,^{N,{T - P)}-' EIp+i " ^7 B..^Ml 

mmB..^^{N,iT-P)}-'Zlp+i \\Y,l^ - Xj B..^£, 
Ep=i Ej^A/-,P"II^Pi-WVj|2 + J2p=iP'"Wj\Bpjj\ + Ep=i EjGW.P"II^Pi-i^M-ill2 (7) 



with hyperparameters X^f,,l^f^, ce, 7Ar, = and rjj^^ = \m,I^2M,, ^ = 1, . . . , / where I is the 

overall number of segments. We call this estimate B = {B..j^^, . . . , B-Mj) the segmentized grouping 
estimate. 



Chudik and Pesaran (2007) consider the case P = 1 and T is large (relative to J) through some 
"neighboring" procedure, which can be viewed as a special case of the "segmentized grouping" we 
studied here. 



2.5 Forecast Evaluation and Choice of Parameters 

The three penalization methods discussed above critically depend on penalty parameter selection for 
their performance in model selection, parameter estimation and prediction accuracy. Here we have 
hyper-parameters A, 7 (universal grouping), Aj, 7^, 1 ^ j ^ J (universal grouping), Aj, 7^, rji,! ^ i ^ I 
and a, and choose them via a data driven "rolling scheme". To simulate real-time forecasting, 
we conduct an out-of-sample experiment. Let To and Ti denote the beginning and the end of the 
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Figure 2: Illustration of the Rolling Scheme 

evaluation sample respectively. The point estimate of the jth variable's forecast is denoted by 
based on cr{t), the information up to time t. The point estimate of the one-step-ahead forecast is 
computed as in equation ([s]), and the /i-step-ahead forecasts are computed in similar spirit. Out-of- 
sample forecast accuracy is measured in terms of mean squared forecast error (MSFE): 

t=To 

We report results for MSFE relative to the benchmark (random walk with drift) model's (abbreviated 



as MSFEfl), as also considered by Bahbura et al. (|2010|), i.e. 



RMSFE 



MSFEfl ■ 

The parameters are estimated using the observations from the most recent 10 years (rolling scheme) 
as illustrated in Figure [2] The parameters are set to yield a desired fit for the variable (s) of inter- 
est from To to Ti. In other words, to obtain the desired magnitude of fit, the search is performed 
over a grid of A, 7 and a to minimize ^/^i RMSFEf^^'''^ (universal grouping); Aj,7j and a to 
minimize RMSFE^'^il^'"'^ (no grouping); Xi,'yi,rii and a to minimize XljeA/; RM SFE^^^'^'"^ (segmen- 
tized grouping) respectively. Due to computational cost, we prefix a to be 1 or 2 first, and then 
do the search of A's and 7's over loose grids. For the nice performing A's and 7's, we search over 
denser grids around them afterwards. The parfor command in Matlab is used to facilitate parallel 
computations to fasten this process. Also, using the least angle regression package provided at www- 
stat.stanford.edu/~tibs/glmnet-matlab, makes the computation time together with the parameter 
selection very moderate in our experience. 

2.6 Algorithm 



Motivated by the adaptive lasso procedure, Zou (2006), if we define 
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• V = diag[l°, 2", . . . , P"] (g) Ijxj, where diag[l°, 2", . . . , P'^] is the diagonal matrix with diagonal 
entries 2~", . . . , P~°'}, ® is the Kronecker product and Ijxj is the J x J identity matrix; 

• W = Ipxp ^ dmg[wi,W2, . . . ,wj]; 

• = X^W-^V-^ and B = VWB 

and note the fact that X^ B in ^ is the same as X'^W^^V^^VyVB = X^ B, we have the following 
estimation procedure (the proof is very simple and hence is omitted): 

(1) Generate X^ = X^W-^p-^; 

(2) Corresponding to the three different estimates Q, ([s]) and ([T]), solve: 



mm 

B 



mm 

B. 



t=p+i p=i j=i p=i j=i 

T P P 

(T - P)-i (^*^- - ^^B.,f + \,Y.Y. \^v^^ + 7,E \Bpnl (9) 
t=p+l p=l i^tj p=l 

T P P 

min{iV,(T-P)}-i r,i-X75..^,J|^ + AA.,5^5^||5p,.||2 + 7MEl^^'^-^-| 

t=P+l p=l j^Ni p=l 

P 

+ XI 5Z (10) 



(3) Output -B = W^^V^^B with 5 minimizing ^ (universal grouping), B = {B..i, . . . , B..j), 
B..j minimizing ^ (no grouping); B = (i?..^^, . . . , B..j^- minimizing (10) (segmentized 

grouping). 



At Step (2), motivated by Wang et al. (2007), as we have more than one penalty terms (mixed 
Lasso and group Lasso), we could iterate between penalties to solve it as the standard (group) Lasso 
problem. 

For the "no grouping" estimate, by noting that •jj = and 7j^p=i \Bpjj\ = ^j^p=i \fJ'jBpjj\, 
the estimation procedure above is equivalent to: 

(1) Generate X""" = X'^yV'~^V"^ with W' = IpxP ® diag[wi,W2,Wj_i,UjWj,Wj+i, . . . ,wj] for esti- 
mating B..j, 1 ^ j ^ J; 

(2) Corresponding to ([9]), solve: 

T P J 

iin(T - P)-i Y (^*^- - ^t"^-^)' + E E \^P^^\-^ (11) 



mm( 

B.. 



t=P+l p=l i=l 
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(3) Output B = W ^V'^B with B = {B..i, B..j), B..j minimizing Q (no grouping). 



At Step (2), we could avoid iterating between multiple penalties and just solve it as the standard Lasso, 
e.g. by using the least angle regression package provided at www-stat.stanford.edu/~tibs/glmnet- 
matlab. 

In "large J, small T" paradigms, to get parsimonious models, shrinkage with penalization in 
model selection can shrink insignificant regression coefficients towards zero exactly, but at the same 
time, significant coefficients are shrunk as well though they are retained in selected working models. 



Wainwright (2009) and Huang et al. (2008). To this end, we only use our method for the variable 



(and lag) selection, but not for estimation, Chernozhukov et al. (2011). Thus, we implement the 



ordinary least squares estimation for the selected variables (and lags) from ([8]), (|9]) and (10) w.r.t. 
three different estimates. 



2.7 Comparison 

Now it is a matter of what kind of regularization techniques among these three choices to use in 
practice. First, as already discussed in Remark 2.2.1 and 2.3.1, from the allowing individualized 
(for the variable of particular interest) weights between own lags and others' lags and individualized 
forecasting performance optimization point of view, the "no grouping" approach is the best, the 
"universal grouping" one is the worst, and the "segmentized grouping" one is in between. Second, as 
in Remark 2.2.2 and 2.3.2, from whether all off-diagonal autocoefficients in one row are shrunk to zero 



point of view, the "no grouping" one is still favored. Third, as in subsection 2^, the tuning parameters 
w.r.t. the universal grouping, no grouping or segmentized grouping estimates are selected to optimize 
the averaged forecasting performance for all variables, the speciffc variable's forecasting performance 
or the averaged forecasting performance for the variables in the same segment respectively. When 
different variables' time series have very distinct patterns, this individualized optimization is preferred. 
Fourth, for the estimation of large coefficient matrices, due to the strong group-sparse assumption on 



the underlying structure as mentioned in subsection 2.2, the group lasso type estimator actually has 



a sharper theoretical risk bound, Huang and Zhang (2009) for more details. In particular, they show 



that group Lasso is more robust to noise due to the stability associated with group structure and thus 
requires a smaller sample size to satisfy the sparse eigenvalue condition required in modern sparsity 
analysis. And the universal grouping estimate is also more computationally efficient since the whole 
autocoefficient matrix is estimated at once. However, note that the statistical error is a combination of 
modeling error and estimation error. Even though the group Lasso type estimate might have smaller 
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estimation error, due to the strong assumption to the underlying structure, the overall risk might 
not be smaller, as we discussed in subsection |2.2[ Moreover, the typical macroeconomic data has low 
frequency, i.e. monthly. Thus the computational cost is not a severe problem since we only need to 
update the model once per month at most. Due to all these, we suggest the no grouping estimate 
for practical implementation as a compromise between flexibility and realization of assumptions. For 
this reason and technical simplicities, we mainly study the theoretical properties of the no grouping 
estimate as defined in (161) afterwards. 



3 Estimates' Properties 



In this section, we first show that, under the time series setup, if we just use the classic Lasso 



estimator, the risk bound will depend on the time dependence level as in Theorem |3.1[ To circumvent 
this problem, through reweighting over time, our estimate in ^ can still produce an estimator, which 



is shown in Theorem 3.2 and 3.3, to be equivalent to an appropriate oracle. The techniques of the 



proofs are closely built upon those in Lounici et al. (2009), Bickel et al. (2009), Lounici (2008) and 



Wang et al. (2007). 



3.1 Dependence Matters? 

Now we will illustrate how the temporal dependence level affects the risk bounds of the Lasso type 
estimator. For technical simplicities, we consider the univariate AR(P) (or MA(P)) model with 
P — > oo, i.e. J = 1 for equations ([TJ and (|2]): 

et = XtiOi + . . . , xtpOp + et = xj6 + et, (12) 

with the regressors {xti, . . . , Xtp) = xj, the coefficients (^^i, . . . , 9p) = 9^ and the error term ej. We 
also define x as a T x P matrix with the t,pth entry as Xtp and e = (ei, . . . , eT)~^ ■ Xtp = et-p (or 
Ct-p) corresponds to the AR(P) (or MA(P)) model. In this situation, since there are no "others lags" 
(J = 1) and 6^ is a vector, the standard Lasso estimator 9 is defined through: 



T 

min(T-P)-i {et-xj9f + 2X\\9\\i. (13) 
t=p+i 



We assume there is a true coefficient 9* for ( [12| ) and define M{9*) = X]p=i l(^p 0) M{9) 



Ylp=i 7^ 0)- Before moving on, we recall the fractional cover theory based definition first, which 



was introduced by Janson (2004) and can be viewed as a generalization of m-dependency. Given a 



set T and random variables \4, t G T, we say: 
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• A subset T' of T is independent if the corresponding random variables {Vt}teT' ^^e independent. 

• A family {Tj}j of subsets of T is a ccwer of T if % = 

• A family {(7j, Wj)}j of pairs (TJ, wj), where Tj '^T and Wj e [0, 1] is a fractional cover of T if 

wjItt ^ Ir, i.e. Zl^-.teT^. wj ^ 1 for each t eT. 

• A (fractional) cover is proper if each set TJ in it is independent. 

• X{T) is the size of the smallest proper cover of T, i.e. the smallest m such that T is the union 
of m independent subsets. 

• X*{T) is the minimum of wj over all proper fractional covers {(7}, Wj)}j. 

Notice that, in spirit of these notations, X(J~) and X*{T) depend not only on T but also on the 
family {Vt}teT- Further note that X*{T) ^ 1 (unless T = 0) and that X*{T) = 1 if and only if the 
variables e T are independent, i.e. X*{T) is a measure of the dependence structure of {Vt}teT- 
For example, if Vt only depends on Vt-i, . . . , Vt-k but is independent of all {V^}s<t-ifc, we will have 
k + 1 independent sets: 

Ti = {Vi, V(fc+i)+i, V'2(fc+i)+i, . . .}, 

T2 = {V2, V(fc+i)+2, V2{k+l)+2, ■■■}, 
7k+l = {^k+l, V(fc+i)+(fc+i), V2(fe+l)+(fe+l), . . .}, 

s.t. Tj^r.So X*{T) = A; + 1 (if A; + 1< T). 

Before stating the first main result of this section, we make the following two assumptions. 

ASSUMPTION 3.1 With a high probability q, V p, the random variables Xtp and et satisfy 

T 

\etxtp\^btand T'^J^^^t^C 

t=i 

for some constants bt,C' > 0,t — 1, . . . ,T . 

ASSUMPTION 3.2 There exists a positive number k — k{s) such that 

minji^^ : \n\ ^ s, A G M^\{0}, || A-^c 3 || A^^ ||i } ^ k, 
'-VT|A7^|2 

where TZ'^ denotes the complement of the set of indices TZ, A-ji denotes the vector formed by the 
coordinates of the vector A w.r.t. the index set TZ. 
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Assumption 3.2 is the restricted eigenvalue assumption from Bickel et al. (2009), which is essentially 



a restriction on the eigenvalues of the Gram matrix \E't = x/T as a function of sparsity s. To see 



this, recall the definitions of restricted eigenvalues and restricted correlations in Bickel et al. (2009): 



mm 



u 



:l^M(z)^u \Z\2 



max 



l€z^P, 



liC z < P, 



m2 



max 



■f ' ' f 

>- J Jl 2 J2 2 



where denotes the cardinality of Jj and xi- is the T x |Jj| submatrix of x obtained by removing 



from X the columns that do not correspond to the indices in /j. Lemma 4.1 in Bickel et al. (2009) 



shows that if the restricted eigenvalue of the Gram matrix satisfies ?/'min(2s) > 3V's,2s for some 



integer 1 ^ s ^ -P/2, Assumption 3.2 holds. 
We can now state our first main result. 



THEOREM 3.1 Consider the model (12) for P ^ 3, T ^ 1 and random variables Vt = etXtp, t e r. 



Let the random variables Xtp and et satisfy Assumption 3.1 for any p, all diagonal elements of the 
matrix x~^x/T euqal to 1, and M{6*) ^ s. Furthermore, let k be defined as in Assumption 3.2, and 



dmnr bc thc maximum eigenvalue of the matrix X^X/T. Let X = ^X*{T){\og Py+^'C'/T, 6' > 0. 
Then with probability at least q{l — P^^'), for any solution 9 of (13), we have: 

5 ,,2 



T 



-1 1 



x( 



^ iQsx*{r)i\ogpy+^'c'/TK^, 
e-e* 111 ^ 16 ^ iQsy/x*{T){\ogpy+^'C'/T/K\ 



M{e) ^ uctta.s/>^. 



(14) 
(15) 
(16) 



Before explaining the results, we would like to discuss some related results first. Suppose x in (12) 



xe. 



has full rank P and is N(0,cr^). Consider the least squares estimate (P ^ T) Oqls = {^^'^ 
Then from standard least squares theory, we know that the prediction error \\x^{9ols ~ ^*)ll2/'^^ is 

J\x^{eoLs-0*)\\i 



Xp-distributed, i.e. 



a 
f 



-P. 



(17) 



In the sparse situation if et is N(0, a ) (different from our case). Corollary 6.2 of Biihlmann and van de 



Geer (2011) shows that the Lasso estimate obeys the following oracle inequality: 

11^""^^- -"-'ll^ « c/-^Mm (18) 
with a large probability and some constant Cq. The additional logP factor here could be seen as the 



price to pay for not knowing the set {6*, 6* ^ 0}, Donoho and Johnstone (1994). 
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Similar to the i.i.d. Gaussian situation discussed above, the term s(logP)^+'^' in (20) could be 
interpreted as the price to pay for not knowing the set {6*, 9* ^ 0}. Here we have (logP)^^^' instead 
of logP because we deviate from the typical i.i.d. Gaussian situation and establish the result under 



the more general Assumption 3.1, which could be thought as the "finite second moment" condition. 
And the b' term is the price to pay for this deviation. 

For the case of Xfp = ^t-p (MA(P) model), by the definition of X*{T) and Vj, if k* = max{p, s.t.9* ^ 
0, e^^-^ = e;^^ = ... = 9*p = 0}),we have X*{T) = k* + 1 (ii k* + 1 < T). The RHS of becomes 
16s(A;* + l)(logP)^+''7^«^^- For the case of Xtp = et-p (AR(P) model), which is equivalent to MA(oo) 
and X*{T) ^ T, the RHS of Q becomes 16s(log P)^+^7'«^- Thus X*{T) could be interpreted as a 
measure on how many past lags Vt depends on. Additionally, when the time dependence level increases, 
by the definition of the Gram matrix, k, will decrease since it characterizes how strong x^,. . . ,Xtp 
depend on each other. Still using the MA{k*) example considered above, n could be thought as a 
measure on how strong Cf depends on et-i, . . . , et-k*, which is a complement of the measure of X*(T) 



on the time dependence level. In both cases (MA(P) or AR(P)), Theorem 3.1 states that if we use 
the standard Lasso estimate directly for the time series, the bounds get larger when the dependence 
level {X*{T)) increases and k decreases. In other words, the bound is minimized when X*{T) = 1, 
which corresponds to the independent situation in the literature. When X*{T) reaches T, it will be 
offset by the T in the denominator. Thus the risk bound does not decrease when T increases. The 
intuition behind is clear: if the dependence level is strong, then the additional information brought 
by a "new" observation will be effectively less, i.e. the overall information from {V*}^^ will be less 
correspondingly, which will result in increasing estimates' risk bounds. Consequently, we expect the 
selection not to be stable and to be very sensitive to minor perturbation of the data. In this sense, we 
do not expect variable selection to provide results that lead to clearer economic interpretation than 
principal components or Ridge regression. 



3.2 Consistency of Selection 

To study the oracle properties of the estimator in (|6|, we assume that there is a correct model 
with the regression and autoregression coefficients (3* = {c*~^,d*~^)T = {cl, ... ,c*pf^j_^-^,dl, ...,(!}>)' . 
Furthermore, we assume that there are a total of po ^ P( J — 1) non-zero other-lag coefficients and 
go ^ P non-zero own-lag coefficients. For convenience, we define Si = {1 ^ i ^ P{J — l),c* 7^ 0}, 
Si = {l^i^ P{J - 1), Ci ^ 0}, ^2 = {1 ^ P ^ i^, 7^ 0} and = {1 ^ p ^ P, c/p ^ 0}. Then, the 
sets 5*1 and 5*2 contain the indices of the significant others-lag and own-lag coefficients respectively, 
and their complements and 6*2 contain the indices of the insignificant coefficients. Next, let 
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denote the ^ 1 significant other-lag coefficient vector with £5^ being its associated estimator. 
Moreover, other related parameters and their corresponding estimators are analogously defined (e.g. 
c^c, csi, d*g^, ds2, dgc, ds^)- Finally, let (31 = (c^'^, d*^J' and = (c^'c, dgc)' with corresponding estimates 
Pi, (32- To facilitate the study, we also introduce the notations 

ot == max(Ai,7p,z e Si,p e S2), 
br = mm{Xi,'yp,i e Sl,p e S2), 

where Aj and 7p are functions of T. To investigate the theoretical properties of /3, we introduce the 
following conditions: 

Al The sequence {Xt} is independent of et {et '= Utj); 

A2 All roots of polynomial 1 — J2p=i ^*p^p outside the unit circle; 

A3 Et has finite fourth-order moment, i.e. E{ef) < 00; 

A4 Vj, Ytj (as components of the covariate Xt) is strictly stationary and ergodic with finte second- 
order moment (i.e. E < 00). 

The technical conditions above are typically used to assure the -\/r-consistency and asymptotic nor- 
mality of the unpenalized least squares estimator. 
We also rewrite equation ^ as 

T P{J-^) P 

minQT(/3) = min {yt-Xj(3f + T ^ A,|c,| + T^^ 7p|dp| (19) 
t=p+i 1=1 p=i 

by multiplying 2(T — P) and writing T — P as T without confusion. Define Ylt=p+iiyt ~ -^7 f^Y 

Lt{(3), xJ = {Xj^t, X2j^t, • • • 5 Xjj^t), the own lags corresponding to the coefficients c, and zj = Xj\xJ , 

others' lags corresponding to the coefficients d respectively. Then Xj (3 = xjc + zjd. 

We first investigate the consistency of the estimator of 

LEMMA 3.1 Assume that ar = o{l) as T ^ 00. Then, under conditions (AI-A4), 3 a local 
minimizer (3 of ^ s.t. 

\\P-l3*\\=0,{T-^'^ + aT). 

The proof is given in the appendix. Lemma [XT] implies that, if the tuning parameters associated with 
the significant regressors converge to at a speed faster than T~^/^, then there is a local minimizer 
of ([6]), which is A/T-consistent. Next, we show that, if the tuning parameters associated with the 
insignificant regressors shrink to slower than T~^/^, then their coefficients can be estimated exactly 
as with probability tending to 1. 
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THEOREM 3.2 (Consistency of Selection) Assume thatbxVT oo and = 0{T-'^/^] 

then 

F02 = 0) ^ 1. 



Theorem |3.2| shows that our method can produce a sparse solution for insignificant coefficients con- 
indicates that the \/T-consistent 



3.1 



sistently. Furthermore, this theorem, together with Lemma 
estimator must satisfy P(/32 = 0) — )■ 1 when the tuning parameters fulfill the appropriate conditions. 
Finally, we obtain the asymptotic distribution of this estimator. 



THEOREM 3.3 Assume that a^yT — )■ and hx^T — )■ oo. Then, under conditions (AI-A4), the 



"nonzero" components (3i of the local minimizer /3 in Lemma 3.1 satisfies 



(/3i-/3*)VT AiV(0,S 



)^ 



P(5i = 5i) ^ 1, P(52 = S2) ^ 1, 
where Sq is the submatrix ofT, corresponding to f3l, S = diag(i?,C), B = E{xtxJ) and C = E{ztzJ). 



Theorem 



3.3 



implies that, if the tuning parameters satisfy the conditions axvT — )■ and bxVT — )■ 00, 
then, asymptotically, the resulting estimator can be as efficient as the oracle estimator. And our 
method can produce a sparse solution for significant coefficients consistently. 

Since consistency of selection is established here, if we use the ordinary least squares estimation 



for the selected variables, we can avoid the log term on (18) and (14). 



4 Application 



We use the dataset of Stock and Watson (2005a) for illustration. This dataset contains 131 monthly 



macro indicators covering a broad range of categories including income, industrial production, ca- 
pacity, employment and unemployment, consumer prices, producer prices, wages, housing starts, 
inventories and orders, stock prices, interest rates for different maturities, exchange rates, money 
aggregates and so on. The time span is from January 1959 to December 2003. We apply logarithms 
to most of the series except those already expressed in rates. The variables of special interest include 
a measure of real economic activity, a measure of prices and a monetary policy instrument. As in 



Christiano et al. (1999), we use employment as an indicator of real economic activity measured by the 



number of employees on non-farm payrolls (EMPL). The level of prices is measured by the consumer 
price index (CPl) and the monetary policy instrument is the Federal Funds Rate (FFR). All 131 
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P = 1 


P = 4 


P = 7 


P = 13 


P = 25 


BVAR 






EMPL 


0.3333 


0.3336 


0.3338 


0.3341 


0.3335 


0.46 


h 


= 1 


CPI 


0.3623 


0.3618 


0.3613 


0.3621 


0.3623 


0.50 






FFR 


0.4279 


0.4281 


0.4281 


0.4284 


0.4287 


0.75 






EMPL 


0.5191 


0.5188 


0.5192 


0.5191 


0.5189 


0.38 


h 


= 3 


CPI 


0.4990 


0.4992 


0.4986 


0.4995 


0.4996 


0.40 






FFR 


0.4615 


0.4614 


0.4619 


0.4617 


0.4628 


0.94 






EMPL 


0.4730 


0.4730 


0.4735 


0.4729 


0.4736 


0.50 


h 


= 6 


CPI 


0.4880 


0.4874 


0.4884 


0.4885 


0.4891 


0.40 






FFR 


0.5237 


0.5242 


0.5243 


0.5243 


0.5250 


1.29 






EMPL 


0.4997 


0.4991 


0.4992 


0.4997 


0.5002 


0.78 


h-- 


= 12 


CPI 


0.4689 


0.4687 


0.4689 


0.4694 


0.4686 


0.44 






FFR 


0.4201 


0.4199 


0.4201 


0.4200 


0.4216 


1.93 



Table 1: RMSFE w.r.t. different choices of h and P. 



variables' lags are used as regressors. As discussed earlier, because of the stationary requirement of 
our method, the series are transformed to obtain stationarity so that many of the series are (2nd 
order) differences of the raw data series (or logarithm of the raw series). 

We evaluate the forecast performance over the period from Tq = January 70 to Ti = December 
03 and for forecast horizons up to one year {h = 1,3,6,12). The order of the VAR is set to be 
P = 1, 4, 7, 13, 25. The resulting performance is summarized in Table [T] with comparisions to the ones 



of Bahbura et al. (2010) listed under the "BVAR" column. As we can see, unlike the information 
criteria based on lag selection techniques, the RMSFE is very robust to the initial choice of P, which 
primarily benefits from the "re-weighting over lags" technique (p~") we used before. For this specific 
data set, P = 1 seems enough. But in general, since we never know the true value of lags, we can 
include a large enough P at the beginning to allow flexibility without worrying about over fitting. 
Moreover, for the one-step-ahead forecast, our method outperforms for EMPL, CPI and FFR, while 
when /i ^ 3, it outperforms mainly for EMPL and FFR, especially for the latter one. This results 
from the fact that different time series might have quite different behaviors, so if we just have the 



"universal" penalty parameter for all of them as in Bahbura et al. (2010 ), the corresponding forecasting 



performance might not be optimized. For reference purpose, we also provide the factor-augmented 
vector autoregressive results of Bernanke et al. (2005) in Figure [3] 
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FAVAR 1 factor 






FAVAR 3 factors 




LARGE 
BVAR 


p = 13 


p = BIC 


BVAR 


p = 13 


p = BIC 


BVAR 


h= 1 


EMPL 


1.36 


0.54 


0.70 


3.02 


0.52 


0.65 


0.46 




CPI 


1.10 


0.57 


0.65 


2.39 


0.52 


0.58 


0.50 




FFR 


1.86 


0.98 


0.89 


2.40 


0.97 


0.85 


0.75 


h = 3 


EMPL 


1.13 


0.55 


0.68 


2.11 


0.50 


0.61 


0.38 




CPI 


0.80 


0.49 


0.55 


1.44 


0.44 


0.49 


0.40 




FFR 


1.62 


1.12 


1.03 


3.08 


1.16 


0.99 


0.94 


h = 6 


EMPL 


1.33 


0.73 


0.87 


2.52 


0.63 


0.77 


0.50 




CPI 


0.74 


0.52 


0.55 


1.18 


0.46 


0.50 


0.40 




FFR 


2.07 


1.31 


1.40 


3.28 


1.45 


1.27 


1.29 


h=n 


EMPL 


1.15 


0.98 


0.92 


3.16 


0.84 


0.83 


0.78 




CPI 


0.95 


0.58 


0.70 


1.98 


0.54 


0.64 


0.44 




FFR 


2.69 


1.43 


1.93 


7.09 


1.46 


1.69 


1.93 



Figure 3: Results of Bernanke et al. (2005) and Banbura et al. (2010) 



5 Concluding Remarks and Discussions 

To summarize, in this article, we first show that under the time series setup, if we still use the classic 
Lasso type estimator, the risk bound will increase when the time dependence level increases, however, 
our method could still achieve the consistency of variable selection under such scenario; second, our 
method is able to do variable selection and lag selection simultaneously, and is rather robust to the 
initial choice of lags; third, we allow individualized weights between own and others' lags. All these 
have been confirmed by the real forecasting performance in the previous section and come at a low 
computational cost. 

Some issues we do not explore here include nonstationarity, rank test, cointegration and causal 
test. For a typical macroeconomic data set, the nonstationarity comes from seasonality, business cycle 



and economic developments. In spirit of Song et al. (2010), this motivates us to add a nonstationary 
component UT to equation ^ as below 

Yt = ZtT + XtB + Ut, 

where Zt = {Zi{t), . . . ,Zf({t))~^ contains R basis functions of time consisting of Fourier series with 
different frequencies and segment by segment ortho-normal polynomials with corresponding R x J 
coefficient matrix F, and Xj, B are the same as in equation ([T|. Studying this extended model deserves 
further investigation and will be presented in a separate paper. If we want to consider the rank test, 
cointegration and causal test, what we need for this high dimensional time series is not the ones in 
the univariate case, but the high dimensional simultaneous tests, which might be much more difficult. 
Heteroscedasticity with Cross-section Correlations. 

We consider Coy{Ut) = S with nonzero off-diagonal entries in S. Assume that we have a consis- 
tent estimate S for S (which is another challenging task since S is a J x J matrix) with Cholesky 
decomposition S = C^C, where C is an upper triangular matrix with inverse D (which is also an 
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upper triangular matrix). Without loss of generality, assume all diagonal entries of S,C and D are 
equal to 1. Transform the original Xt by D to generate Xt {Xt = XtD) s.t. Coy{UtD) = I. Under 
this situation, we are no longer selecting the original variables, but linear transformations of them. 
Thus we must show that this does not affect the inference. We have 

+ ^2Xt2 + ■■■ + PjXtJ 

J-1 

=hxtl + P2{dl2Xtl + Xt2) + . . . + PjC^ djjXtj + Xtj) 

i=i 

J J 

={h + /3jdij)xti + (^2 + (3jd2j)xt2 + ...+ PjXtj. 

3=2 j=3 

If the off-diagonal entries of D, {c?jj}i<j, are much smaller than the diagonal entries 1, it is likely that 
the selected nonzero sets of /3's (or c, d) are the same as the selected nonzero sets 5*1 and 5*2 of /3's, 



which have been shown to be the same as the oracle ones. Theorem |3.2| and |3.3[ By the definition 
of D, this means that the off-diagonal entries of C, S and S should also be much smaller than their 
diagonal entries 1, e.g. the cross-section correlations must be weak enough, which aligns with the case 



for dynamic factor models, Forni et al. (2000). 
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6 Appendix 



Proof of Theorem |3.1| The proof of this theorem is based on the ones of Lemma 3.1 and Theorem 



3.1 in Lounici et al. (2009) up to a modification of the bound on P(^'^) with random event A 



I maxi^p^P Ylt=i ^tXtp ^ At| , where n, M and T there are equivalent to T, P and 1 here respectively. 



The intermediate results in the proof of Theorem 3.1 in Lounici et al. (2009) show that 

T-^\\ X{B~B*) f ^ 16sAV«;^ 
II i? - fi* II ^ 16 ^ 16sA/k^ 
M(5) ^ em^^s/n'. 



(20) 
(21) 
(22) 
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We have: 



1 1 1 

V{A') = P ( max^5^£iXtp > at) ^ PP (^e^X^p > At) = PP (^1^^ > AT 

IsgpsgP 



Then, by the (extended) Mcdiarmid inequahty, see Theorem 2.1 of Janson (2004), with random 
vectors {Vt}'^^i, we have 



T . 

V{A^) ^ PP (5^Vt > at) ^ Pexp <^ - 
t=i ^ 



A^T 



^ P 



with A = ^X*{r){\ogPY+^'C' /T, 5' > 0, which, together with (|20|), (|2T[ and (g, leads to (|T4|) 



(15) and (16) 



□ 



Proof of Lemma 3.1 The proofs are closely built upon those of Wang et al. (2007). Let S 



{U^,v'^y, U = {Ui,..., Mp(j_i))^, V 


= {vi,..., vpY, ar = T-V2 + and + otS : \\ 


6\\ ^ e} be 


the ball around Then, for \\6\\ = 


e, we have 




Dt{6) = Qt{/3* + ard) - Qrin 






= Lt{(3* + ar^) - Lt{(3*) 


ieSi jeS2 


-\\d*\\) 


= Lt{(3* + ar^) - Lt{I3*) 


— Tax Aj — Tax Ijl'^jl 




= Lt{(3* + ar^) - Lt{(3*) 


— Ta^poe — Ta^qoe 




= Lt{^* + arS) - Lt{(3*) 


- TaKpo + qo)e. 


(23) 



Furthermore, 



Lt{(3* + axS) — Lt{/3*) = ''^^{ut — arzjv — aru^ XtY — ''^^u^ 

t t 

t 

—2aT Ufzjv 
t 

+2a^ zJvu^Xf. 



(24) 
(25) 
(26) 



By employing the martingale central limit theorem and the ergodic theorem, we can show that (24) 



= Ta^{5^S5 + Op(l)}, (|25|) = 5^C»p(Ta^) and (|26|) = Ta^Op(l) = Op(Ta|). Because (|24|) dominates 
the terms (25), (26) and Taf.(po + 9o)e in equation (23), for any given e > 0, there is a large constant 



e such that 



P[ inf Qt{P* + arS) > QtW*)] > I - e. 

\\5\\=e 
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This implies that, with probabihty at least 1 — e, there is a local minimizer in the ball + '■ 
^ e}, [Bickel et al. (1998) and Fan and Li (2001). Consequently, there is a local minimizer of 



Qt{,P) such that 



= Opi^ar)- This completes the proof. 



□ 



Proof of Theorem 3.2 The proof is essentially the same as those of Theorem 2 of Zou (2006) and 



Wang et al. (2007). For i G S{, assume that there is a local minimizer (3 with q 7^ 0. By the KKT 
optimality condition, we have 







Ci 

dLT{/3* 

Ci 



+ TXisgn{ci) 
+ rS,(/3 - /3*){1 + Opil)} + T\sgn{ci), 



(27) 



where Ej denotes the zth row of S and i G S^. By employing the central limit theorem, the first term 
in equation (27) is of order (9p(T^/^). Furthermore, the condition in Theorem 



3.2 



implies that its 

second term is also of order OpiT^^'^). Both are dominated by TAj since hT\/T 00. Therefore, the 



sign of equation (27) is dominated by the sign of q. Thus (27) can not be equal to 0. Consequently, 



we must have q = in probability. Analogously, we can show that ^{ds^ = 0) — )■ 1. This completes 
the proof. □ 

we have P(/32 = 0) — )■ 1. Hence, the 



Proof of Theorem 



3.3 



Applying Lemma 



3.1 



3.2 



and Theorem 

minimizer of Qt{P) is the same as that of Qt{Pi) with probability tending to 1. This implies that 
the estimator /3i satisfies the equation 



(3i 



0. 



/3i=/3i 



(28) 



According to Lemma 
yields 



3.1 







(3i is V T-consistent. Thus, the Taylor series expansion of equation (28) 

1 dLT0i' 



T (3i 
1 dLrm 



+ 9W1WT 

+ g{f3l)Vf + Sov^(/3i - 131) + Op{l), 



'T (3i 

where g is the first-order derivative of the penalty function 

^ Ai|ci| + ^ij\dj\, 

ieSi jeS2 

and g0i) = g{(3l) when T is sufficiently large. Furthermore, it can be easily shown that g{(3l)\/T 
c'p(l), which implies that 



'T d/3i 
4 N(0,So^). 
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The next step is to show P{Si = Si) — )■ 1 and P(S'2 = 5'2) — > 1. \/i E Si and p E S2, the asymptotic 
normahty resuh indicates that q — ?■ c* and dp — > d*, where — > stands for convergence in probabihty. 
Thus P{i eSi) ^1 and P{p e S2) 1. It suffices to show that W ^ Si and p' ^ S2, P(i' E Si) ^ 



and P{p' G S2) — )■ 0, which have been shown by Theorem 3.2 This completes the proof. □ 
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